Document Clustering with Feature Behavior based Distance Analysis

نویسنده

  • A. Kanimozhi
چکیده

Machine learning and data mining methods are applied to perform large data analysis. Clustering methods are applied to group the related data values. Partitional clustering and hierarchical clustering methods are applied to handle the clustering operations. Tabular format data processing is carried out under the partitional clustering models. Tree based data clustering is adapted in the hierarchical clustering models. Clustering techniques are also applied to group the text documents. Distance measures are employed to estimate the document relationships in clustering process. Cosine and Euclidean distance measures are widely used in the clustering operations. Dimensionality is the key factor in the document clustering process. Document contents are parsed and represented as vector model. Features and associated weight values are assigned under the document vector model. Feature behavior distance model faces the High dimensionality and sparsity issues. Feature based similarity estimation is carried out using Similarity Measurement for Text Process (SMTP). Clustering and classification operations are performed with the SMTP distance measure. Text document clustering is performed using the Hybrid Similarity Measure for Text Process (HSMTP). Feature appearance and weight factors are integrated in the HSMTP scheme. The HSMTP scheme is integrated with the Spherical K-Means clustering algorithm to partition the documents. Feature reduction process is initiated to minimize the dimensionality of the document vector. Ontology is used to fetch the concept relationship values. Concept relationship based distance model is also supported by the HSMTP scheme.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Comparing k-means clusters on parallel Persian-English corpus

This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...

متن کامل

Partitioning-based clustering for Web document categorization

Clustering techniques have been used by many intelligent software agents in order to retrieve lter and categorize documents available on the World Wide Web Clustering is also useful in extracting salient features of related web documents to automatically formulate queries and search for other similar documents on the Web Traditional clustering algorithms either use a priori knowledge of documen...

متن کامل

Document Analysis And Classification Based On Passing Window

In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...

متن کامل

A New Approach to Classify Text based on CosFuzzy Logic

Objective type of Examination evaluation is easy in Computer world. But the descriptive type of question evaluation is more difficult and there is no significant research has been taken place. In this paper I propose a new solution to the above problem with text classification using the new fuzzy logic named CosFuzzy Logic. Document Clustering is a useful technique that organizes a large quanti...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015